Yield improvements for three-dimensionally stacked neural network accelerators

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for three-dimensionally stacked neural network accelerators. In one aspect, a method includes obtaining data specifying that a tile from a plurality of tiles in a three-dimensionally stacked neural network accelerator is a faulty tile. The three-dimensionally stacked neural network accelerator includes a plurality of neural network dies, each neural network die including a respective plurality of tiles, each tile has input and output connections. The three-dimensionally stacked neural network accelerator is configured to process inputs by routing the input through each of the plurality of tiles according to a dataflow configuration and modifying the dataflow configuration to route an output of a tile before the faulty tile in the dataflow configuration to an input connection of a tile that is positioned above or below the faulty tile on a different neural network die than the faulty tile.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. application Ser. No.15/685,672, filed on Aug. 24, 2017. The disclosure of the priorapplication is considered part of and is incorporated by reference inthe disclosure of this application.

BACKGROUND

This specification generally relates to three-dimensionally stackedneural network accelerators.

Neural networks are machine learning models that employ one or morelayers of nonlinear units to predict an output for a received input.Some neural networks include one or more hidden layers in addition to anoutput layer. The output of each hidden layer is used as input to thenext layer in the network, i.e., the next hidden layer or the outputlayer. Each layer of the network generates an output from a receivedinput in accordance with current values of a respective set ofparameters.

Typically, neural network processing systems use general purposegraphics processing units, field-programmable gate arrays,application-specific integrated circuits, and other hardware of the liketo implement the neural network.

SUMMARY

In general, one innovative aspect of the subject matter described inthis specification can be embodied in methods that include the actionsof obtaining data specifying that a tile from a plurality of tiles in athree-dimensionally stacked neural network accelerator is a faulty tile.The three-dimensionally stacked neural network accelerator includes aplurality of neural network dies stacked on top of each other, eachneural network die including a respective plurality of tiles, each tilehas input and output connections that route data into and out of thetile. The three-dimensionally stacked neural network accelerator isconfigured to process inputs by routing the input through each of theplurality of tiles according to a dataflow configuration and modifyingthe dataflow configuration to route an output of a tile before thefaulty tile in the dataflow configuration to an input connection of atile that is positioned above or below the faulty tile on a differentneural network die than the faulty tile.

Other embodiments of this aspect include corresponding systems,apparatus, and computer programs, configured to perform the actions ofthe methods, encoded on computer storage devices.

A neural network accelerator can be used to accelerate the computationof a neural network, i.e., the processing of an input using the neuralnetwork to generate an output or the training of the neural network toadjust the values of the parameters of the neural network.Three-dimensionally stacked neural network accelerators can beconstructed with vertical interconnects that communicatively couplevertically adjacent dies. Three-dimensionally stacked neural networkaccelerators are cheaper to fabricate and more compact than traditionalneural network accelerators. However, traditional mechanisms forfabricating three-dimensionally stacked neural network accelerators makeit unlikely that a given three-dimensional neural network acceleratorsis fabricated with only functional dies, i.e., is fabricated without oneor more dies being faulty.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages. For three-dimensionally stacked neural networkaccelerators, the more computing tiles that are stacked on top of eachother, the higher the probability that the whole stack is faulty. Thisoccurs because if one computing tile is faulty, this may render theentire three-dimensionally stacked neural network acceleratorinoperable, resulting in potentially poor yield of operablethree-dimensionally stacked neural network accelerators. However,modifying dataflow configurations for three-dimensionally stacked neuralnetwork accelerators increases functionality for the three-dimensionallystacked accelerators. For example, modifying the dataflow configurationallows the three-dimensionally stacked neural network accelerator tostill be usable even if one or more computing tiles are faulty.

Attempting to use the faulty tiles will render the entirethree-dimensionally stacked neural network accelerator useless.Therefore, the faulty tiles are bypassed to ensure functionality of theremaining portions of the three-dimensionally stacked neural networkaccelerator. Modifying the dataflow configuration for athree-dimensionally stacked neural network accelerator includes alteringoutputs of given computing tiles to inputs of computing tiles on diesabove or below the given computing tiles. Thus, enabling a more modularflow of data throughout the three-dimensionally stacked neural networkaccelerators. In addition, modifying the dataflow configuration willimprove the yield of operable three-dimensionally stacked neural networkaccelerators because having one or more faulty computing tiles will notrender the entire accelerator inoperable. Three-dimensionally stackedneural network accelerator yields reduce as the total chip areaincreases. Modifying dataflow configuration to include transmitting databetween vertically adjacent dies increases yields forthree-dimensionally stacked neural network accelerators.

The details of one or more implementations of the subject matterdescribed in this specification are set forth in the accompanyingdrawings and the description below. Other potential features, aspects,and advantages of the subject matter will become apparent from thedescription, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-B are block diagrams of an example three-dimensionally stackedneural network accelerator.

FIG. 2 is a block diagram of a computing tile.

FIG. 3 is an example block diagram of a bipartite graph.

FIG. 4 illustrates an example neural network dataflow configuration.

FIG. 5 illustrates an example dataflow configuration for athree-dimensionally stacked neural network accelerator.

FIG. 6 is a flowchart of an example process for modifying a dataflowconfiguration for tiles within a three-dimensionally stacked neuralnetwork accelerator.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

The subject matter described in this specification relates to a hardwarecomputing system including multiple computing units configured toaccelerate workloads of a neural network. Each computing unit of thehardware computing system is self-contained and can independentlyexecute computations required by a portion, e.g., a given layer, of amulti-layer neural network.

A neural network accelerator can be used to accelerate the computationof a neural network, i.e., the processing of an input using the neuralnetwork to generate an output or the training of the neural network toadjust the values of the parameters of the neural network. The neuralnetwork accelerator has data inputs and outputs. The neural networkaccelerator receives data, processes the data, and outputs the processeddata. A three-dimensionally stacked neural network accelerator uses aplurality of neural network dies stacked on top of each other toincrease computing power for a neural network accelerator. Each neuralnetwork accelerator die includes a plurality of computing tiles. Eachcomputing tile also has an input, an output, and processes data using acomputing tile processor.

Tiles are connected together in sequence and the neural networkaccelerator directs data between each of the tiles according to adataflow configuration. For example, data is received at a firstcomputing tile, a computation is executed, and the first tile's outputis transmitted to the input of a second computing tile, which alsocompletes a computation. In some instances, a computing tile may befaulty (i.e., not functioning as intended) after the accelerator hasbeen manufactured. For example, the tile may have non-functioning on-diecache memory, damaged intra-die connections, an incorrect clock, and soon, which may render the entire neural network accelerator inoperable.

However, according to the systems and methods described herein, a faultycomputing tile is bypassed during computation by the neural networkaccelerator, i.e., no output is transmitted to the faulty computingtile's input. Instead of routing an output of one tile to the input ofthe faulty tile, the output is routed to an input of a differentcomputing tile that is on a die above or below the die that houses thefaulty tile. After the different computing tile executes itscomputation, the different computing tile sends its output to an inputof another computing tile, e.g., a computing tile that is housed on sameneural network die that houses the faulty computing tile. This bypassesthe faulty tile and enables the use of the three-dimensionally stackedneural network accelerator even with one or more faulty tiles.

FIGS. 1A-B are block diagrams of an example three-dimensionally stackedneural network accelerator 100. A neural network accelerator 100 is anintegrated circuit that is designed to accelerate the computation of aneural network, i.e., the processing of an input using the neuralnetwork to generate an output or the training of the neural network. Athree-dimensionally stacked neural network accelerator 100 includes aplurality of neural network accelerator dies 102 a-e stacked on top ofeach other creating a large-scale neural network accelerator.

Typically, a neural network accelerator wafer 102 a-e is created usingsemiconductor material (e.g., silicon, gallium arsenide, indiumarsenide, etc.). Neural network accelerator dies 102 a-e aremanufactured using traditional semiconductor wafer fabricationtechniques. Each of the neural network accelerator dies 102 a-e includea plurality of computing tiles, hereafter referred to as tiles 104 a-p,arranged on a surface of the die 102 a-e.

Each tile 104 a-p is an individual computing unit and the tiles 104 a-pcollectively perform processing that accelerate computations across thethree dimensionally stacked neural network accelerator. Generally, atile 104 a-p is a self-contained computational component configured toexecute all or portions of the processing for a neural networkcomputation. Example architectures for the tiles 104 a-p are describedin U.S. patent application Ser. No. 15/335,769, which is incorporatedherein by reference.

The tiles on each die are connected according to a static interconnectsystem. For example, each tile is communicatively coupled, usinginductive coupling, through silicon via (TSV), or wired to one or moreadjacent tiles including adjoining tiles, tiles above each other, tilesbelow each other, etc.). The configuration of a static interconnectsystem will be described in more detail below in connection with FIG. 3.

FIG. 2 is a block diagram of a computing tile 104 a. Each computing tile104 a includes a processing element 202, a switch 204, inductive coils206 a and 206 b, and a processing element bypass 208. The componentsillustrated in FIG. 2 are not drawn to scale. Typically, the processingunit 202 consumes most of the tile area and is more likely to have adefect or fault. Defect density of computing tiles is uniform, thus thelarge area of the processing element 202 causes the processing element202 the component most likely to fail. The processing element receivesan input from an output of the switch 204. The processing element 202executes computations for the neural network accelerator. The output ofthe processing element 202 can be transmitted to the input of a switch204 of a different tile 104 a-p.

Some or all of the tiles 104 a-p include inductive coils 206 a, b.Although, FIG. 2 illustrates a tile 104 a with two inductive coils 206a, b, typically, and in some implementations, the computing tile caninclude between 10 and 1000 coils. The inductive coils enable inductivecoupling of vertically adjacent tiles 104 a-p using magnetic fieldsbetween the tiles 104 a-p. The inductive coupling of tiles 104 a-penables tiles 104 a-p on different dies to communicate using near fieldwireless communication. Each tile 104 a-p communicates with adjacenttiles above or below the tile using the inductive coupling. For example,the first tile 104 a on the top die 102 a can transmit and receive datafrom the first tile 104 a on the second die 102 b located under the topdie 102 a.

Typically, tiles 104 a-p communicate with adjacent tiles directly aboveor below itself using inductive coupling. However, in someimplementations, tiles within a three-dimensionally stacked neuralnetwork accelerator can communicate with tiles on any die 102 a-e withinthe three-dimensionally stacked neural network accelerator 100. Forexample, for a three dimensionally stacked neural network acceleratorwith 7 stacked dies, a particular tile positioned on a given die canvertically communicate with tiles above or below the particular tile onany of the other 6 stacked dies. One way of implementing the near-fieldcommunication technology is described in “ThruChip Interface for 3Dsystem integration” by Tadahiro Kuroda athttp://ieeexplore.ieee.org/document/5496681/.

The inductive coils 206 a, b included in the tile 104 a are a receivercoil and a transmitter coil. The receiver and transmitter coils can eachinclude a plurality of inductive coils coupled together to transmit andreceive data from vertically adjacent tiles. The plurality of inductivecoils are coupled together to achieve the desired bandwidth and magneticfields to communicate data between vertically adjacent tiles. Either ofthe inductive coils 206 a, b can be selected to be the receiver coil orthe transmitter coil according to the determined dataflow configuration.The receiver and transmitter coil respectively and independently receiveor transmit data between tiles 104 a-p on different dies. The coils eachproduce a magnetic field and using the magnetic field the tilescommunicate using near field communication. For example, the magneticfield belonging to a transmitter coil on a given tile 104 a-p is coupledto a magnetic field belonging to a receiver coil 104 a-p of a differenttile. The two coils transfer data by using the magnetic field created bythe inductive coupling as a carrier signal.

The inductive coils 206 a, b can each be selectively chosen as thereceiver coil or the transmitter coil. The inductive coils 206 a, b aredetermined to be a receiver coil or a transmitter coil based on theconfiguration of the inputs and the outputs of the switch 204. Forexample, the inductive coil that receives an output of the switch is thetransmitter coil as it will transmit the received data to a verticallyadjacent tile. The transceiver that transmits data to an input of switchis the receiver coil because this transceiver transmits the data itreceives from a vertically adjacent coil. Modifying the variable inputsand outputs that are defined by the configuration of the switch enablesthe static interconnect configuration to be changed to determine variousdataflow configurations.

In some implementations, each tile can also communicate with verticallyadjacent tiles using through-silicon vias (TSV). A TSV is a verticalelectrical connection that passed through the die. Outputs of processingunits can be passed to the input of a switch 204 belonging to avertically adjacent die using TSVs.

Each tile includes a switch 204 that is coupled to a plurality of inputsand includes a plurality of outputs. In some implementations and asshown in FIG. 2, the switch has four inputs (i.e., inputs A, B, C, andD) and four outputs (i.e., outputs W, X, Y, and Z). The switch 204 candirect any of the plurality of inputs received at the switch to any ofthe plurality of switch outputs. In this instance, the input A can be aprocessing element bypass of an adjacent tile. Input B can be the outputof the processing element 202 of an adjacent tile. Either inductive coilcan be selected as the receiver coil or the transmitter coil. Whicheverinductive coil is configured as the transmitter coil sends data tovertically adjacent tiles and whichever coil is configured as thereceiver coil receives data from vertically adjacent tiles. In theinstance where inductive coil A 206 a is receiver coil, input C can bedata received at inductive coil A 206 a and transmitted to the switch204. Alternatively, and based on the selected dataflow configuration, inthe instance where inductive coil B 206 b is the receiver coil, input Dcan be data received at inductive coil B 206 b and transmitted to theswitch 204.

The switch 204 can transmit data from any of the inputs, inputs A, B, C,or D, to any of the outputs, outputs W, X, Y, and Z. In this instance,output W can direct data to a processing element bypass. The processingelement bypass provides a data transmission path that bypasses theprocessing element 202. Output W enables data to transmitted out of theprocessing tile without transmitting the data to the processing element202. In this instance, the processing element 202 could be faulty.Therefore, the processing element 202 is bypassed, using the processingelement bypass, to ensure continuity of the ring bus. Output X of theswitch 204 is coupled to the input of the processing element 202. Thus,data transmitted by output X of the switch is transmitted to theprocessing element 202 such that the processing element 202 uses thedata for neural network accelerator computations. Outputs Y and Z areeach coupled to the inputs of inductive coil A and B 206 a, b. Outputs Yand Z can be selectively chosen to direct data that is transmitted toinductive coils of vertically adjacent tiles.

In some implementations, tiles 104 a-p communicating with tiles 104 a-pon different dies 102 a-e use the inductive coupling of the inductivecoils to transmit input data from tiles on different dies 102 a-e. Forexample, in the instance where the first tile 104 a on the top die 102 acommunicates with the first tile 104 a on the die 102 b below the topdie 102 a, the first tile 104 a on the top die's switch is coupled tooutput Y. Output Y transmits data to the transmitting inductive coil,which directs the data to the receiving inductive coil of the first tile104 a on die 102 b below. The receiving inductive coil directs the datato the switch 204 of the first tile 104 a on die 102 b below. The switch204 can direct the data to either of the available switch outputs. Inother implementations, tiles 104 a-p communicating with tiles 104 a-p ondifferent dies 102 a-e use through silicon via technology to transmitthe data.

In some implementations, the switch 204 can include one or moremultiplexers and one or more demultiplexers. A multiplexer includes twoor more selectable inputs and one output. The demultiplexer includes oneinput and two or more selectable outputs. Accordingly, and in thisinstance, the switch uses the multiplexer to receive either of the fourinputs, and the output of the multiplexer is coupled to the input of thedemultiplexer. The outputs of the demultiplexer are the four outputs ofthe switch, outputs W, X, Y, and Z.

FIG. 3 is an example block diagram of a bipartite graph. The bipartitegraph illustrates dataflow configuration and components of a neuralnetwork architecture. The edges connect the input vertices (I1-5) to theoutput vertices (O1-5) to represent computing tiles. In this example, acomputing tile is represented by a particular input vertex, a particularoutput vertex connected together with an edge. A dataflow configurationbetween vertices is illustrated with solid edges. For example, an edgethat goes from an output vertex to an input vertex illustrates thetransmission of data from the output of one tile to the input of anothertile. A redundant dataflow configuration is illustrated with dashededges. The dashed edges represent alternative dataflow paths between thevertices in the instance a tile is deemed faulty and is bypassed.

If a tile is faulty, the corresponding edge between an input vertex anda corresponding output vertex is removed from the graph. Thisillustrates that there is no data transmitted from the input of thecomputing tile to the output of the computing tile and the processingelement 202 does not execute any computations. If the switch 204 of thecomputing tile is still functional, the vertices remain in the networkgraph because the processing element 202 of a tile can still be bypassedusing the processing element bypass. Edges from the output vertices tothe input vertices represent the possible connections that the switchesand the vertical communicative coupling can realize. There will bemultiple allowable edges per vertex representing the possibleconfigurations that the switches can be configured to direct inputs tooutputs.

To bypass a faulty tile and increase three dimensionally stacked neuralnetwork yield, a Hamiltonian circuit is applied to the graph. TheHamiltonian circuit can illustrate a ring bus that is a closed tour ofdata propagations such that each active vertex receives and transmitsdata exactly once. The Hamiltonian circuit is the maximum length circuitthat can be achieved by incorporating each functional vertex. The threedimensionally stacked neural network accelerator offers more alternatepaths for dataflow configurations than a two dimensional neural networkaccelerator. Therefore, the probability that an optimal or near optimalconfiguration (e.g., a Hamiltonian circuit) can be found is higher forthe three dimensionally stacked neural network accelerator than for atwo dimensional network accelerator.

FIG. 4 illustrates an example neural network die 400 and a dataflowconfiguration for the example neural network die's tiles 104 a-p.Generally, however, the tiles 104 a-p can be arranged in any arrangementon the die 200. The tiles 104 a-p are organized in a rectangulararrangement such that tiles located on vertically adjacent dies areconfigured in the same position. For example, the first tile 104 a onthe first die 102 a is located above the first tile 104 a on the seconddie 102 b, which is located above the first tile 104 a on the third die102 c, etc. In addition, inputs and outputs of vertically adjacent tileshave a mirrored or rotational symmetry. For example, the inputs andoutputs of the first tile 104 a on the first die 102 a are positionallylocated on the die in the same orientation as the inputs and outputs forthe first tile 104 a on the second die 102 b located in the stack aboveor below the first die 102 a.

Each tile 104 a-p is communicatively coupled with the tile's neighboringtiles on the die 400. In this example, the first tile 104 a iscommunicatively coupled with the second tile 104 b. However, tiles 104a-p can be communicatively coupled in any configuration. Tiles 104 a-pon neural network die 102 a can be connected together using wiredconnections. The wired connections enable transmission of data betweeneach connected tile 104 a-p.

Each tile communicates with one or more adjacent tiles 104 a-p to createa Hamiltonian circuit representation using the tiles 104 a-p. Thecircuit includes a communication scheme such that there is anuninterrupted flow of tile inputs connected to tile outputs, from thebeginning of the ring bus to the end of the ring bus. For example, thetiles 104 a-p are configured such that the input and output of eachfunctional tile within the ring network is connected to anotherfunctional tile or external source according to a dataflowconfiguration. The dataflow configuration describes a path ofcomputational data propagation through the tiles 104 a-p within athree-dimensional neural network architecture.

For example, and referring to FIG. 4, in some implementations, adataflow configuration 402 may specify that a first tile 104 a, on a die102 a, receives input data from an external source. In someimplementations, the external source can be a tile 104 a-b on adifferent neural network accelerator die 102 b-e, or some other sourcethat transmits data. The first tile 104 a executes computations usingthe data and transmits the data to a second tile 104 b. Likewise, thesecond tile 104 b computes the data, and transmits the data to a thirdtile 104 c. In this implementation, the process continues along thefirst row of tiles until the data reaches a fourth tile 104 d. Thefourth tile 104 d transmits the data to a fifth tile 104 h.

The process continues until the data reaches tile 104 e. In like manner,data is transmitted to the ninth tile 104 i. The data is propagatedalong across the third row of tiles to the twelfth tile 104 l. Thetwelfth tile 104 l transmits the data to the thirteenth tile 104 p. Thedataflow configuration continues to transmit the data to the sixteenthtile 104 m, where the sixteenth tile 104 m transmits the data to anexternal source or back to the first tile 104 a. In this implementation,the dataflow configuration for the tiles 104 a-p on the first die 102 ais 104 a-b-c-d-h-g-f-e-i-j-k-l-p-o-n-m. In other implementations, thedataflow configuration can be a different path of data travel throughthe set of tiles. The dataflow configuration is specified based on whichswitch input is connected to which tile's output. Because there are aplurality of tiles 104 a-p, each tile with a respective output, andbecause the switch's input can be varied to receive different tiles'outputs, many different dataflow configurations can be achieved.

In some instances, some tiles 104 a-p may be faulty after production ofthe three dimensionally stacked neural network accelerator 100.Post-production tests are executed after stacking the dies to identifywhich tiles are faulty. The identified faulty tiles are bypassed, and adataflow configuration is established that eliminates the use of thefaulty tiles. The tile configuration includes a redundant data path 304that can be implemented to bypass faulty tiles 104 a-p. Eliminating afaulty tile(s) can include directing computational data to every othertile in the three-dimensionally stacked neural network except theidentified faulty tiles. Each other tile will perform its designatedcomputational functions as part performing the computation of the neuralnetwork. In this instance, the other tiles can collectively perform thecomputations of the identified faulty tiles or one tile can be dedicatedto perform the computations of the identified faulty tile.

FIG. 5 illustrates an example dataflow configuration 500 for athree-dimensionally stacked neural network accelerator. In this example,the neural network accelerator includes two neural network dies, a topdie 102 a and a bottom die 102 b arranged using a pair of inductivelycoupled connections to aggregate two rings together to form one ring of16 tiles distributed over two chips stacked vertically. In someimplementations, the three-dimensionally stacked neural networkaccelerator includes more than two dies stacked together (e.g., 3-10dies). Each die includes a plurality of tiles 104 a-h. The tiles 104 a-hon both dies process data to perform neural network computationsaccording to the dataflow configuration of data propagation through thetiles 104 a-h.

In this example, tile 104 f on the top die 102 a is a total die failure.A total die failure occurs when the bypass fails or the switch fails andthe processing unit fails. In a two dimensional neural networkaccelerator example with one die having a failure scenario consistentwith the top die 102 a illustrated in FIG. 5, the entire die would be atotal failure because the dataflow configuration could not create acontinuous dataflow path around the die. Because tile 104 f on the topdie 102 a is a total die failure, neighboring tiles cannot output datato tile 104 f because there is no way for tile 104 f to output data toanother tile 104 a-h. Thus, as illustrated in FIG. 5, no tiles outputdata to the input of tile 104 f and tile 104 f is completely bypassed bytransmitting data to tiles adjacent and vertically adjacent to tile 104f.

Tile 104 g on the top die 102 a and tile 104 c on the bottom die 102 bare partial failures. A partial failure occurs where a processingelement 202 fails, but the switch and data path is still functional. Atile that is experiencing a partial failure can still receive input databecause the tile's switch 204 is still functional. Therefore, thepartially failed tile's switch 204 can output to data to different tiles104 a-h using the transmitting inductive coils or output data using theprocessing element bypass 208. As illustrated in FIG. 5 tile 104 g onthe top die 102 a and tile 104 c on the bottom die 102 b utilize thetransmitting inductive coils to output data to the vertically adjacenttiles. However, in other implementations, tile 104 g on the top die 102a and tile 104 c on the bottom die 102 b could either both or uniquelyutilize the processing element bypass 208 to output data to adjacenttiles 104 a-h.

FIG. 6 is an example flow chart of a process executed by a system of oneor more computers for modifying a dataflow configuration for tileswithin a three-dimensionally stacked neural network accelerator 600.Aspects of FIG. 6 will be discussed in connection with FIG. 5. Forexample, the process can be performed by a host computer to which theaccelerator is coupled.

Three-dimensionally stacked neural network accelerator tiles are testedto determine functionality of each of the plurality of tiles on theneural network wafers and identify faulty tiles. Techniques for dietesting include using out of band control channels, for example, jointtest action group scan chain, to test each of the tiles within thethree-dimensionally stacked neural network accelerator. The testidentifies which tiles are faulty and are to be bypassed to create theHamiltonian circuit. In some implementations, the system is configuredto determine the Hamiltonian circuit based on the arrangement offunctional tiles 104 a-h within the three-dimensionally stacked neuralnetwork accelerator. In addition, the system can implement the dataflowconfiguration using the remaining functional tiles 104 a-h of the threedimensional neural network accelerator according to the determinedHamiltonian circuit.

The testing can occur prior to or after stacking the neural networkwafers 102 a-b. For example, testing can occur prior to cutting thelarger fabricated wafers into smaller dies designed for thethree-dimensionally stacked-neural network accelerators. In thisinstance, each tile on the larger fabricated wafer is tested forfunctionality. Alternatively, after cutting larger fabricated wafersinto dies designed for the three-dimensionally stacked-neural networkaccelerators, the dies are stacked together to create thethree-dimensionally stacked neural network accelerator, and each tile istested for functionality. In either instance, the tiles 104 a-p aretested prior to executing computations on the three-dimensionallystacked neural network accelerator.

In other implementations, three-dimensionally stacked neural networkaccelerators are constantly analyzed during operation of the neuralnetwork accelerator to determine tiles, that may have been operationalduring the initial functional testing, but have since failed or becomefaulty.

Referring to FIG. 6, the process includes determining that a tile from aplurality of tiles in a three-dimensional stacked neural networkaccelerator is a faulty tile (402). A faulty tile is a tile that doesnot function as designed based on the analyzing. As previouslydescribed, the three-dimensionally stacked neural network acceleratorcomprises a plurality of neural network accelerator dies 102 a-e stackedon top of each other and each neural network accelerator die includes arespective plurality of tiles 104 a-p. Each tile has a plurality ofinput and output connections that transmit data into and out of thetile. The data is used to execute neural network computations.

Referring back to FIG. 4, in the illustrated example 300, the sixth andseventh tile 104 c on the top die 102 a and the third tile on the bottomdie 102 b has been determined to be faulty. Faulty tiles 104 f and g onthe top die 102 a and 104 c on the bottom die 102 b are bypassed andremoved from the dataflow configuration. Removing the faulty tilesincludes modifying the dataflow configuration to transmit an output of atile before the faulty tile in the dataflow configuration to an inputconnection of a tile that is positioned above or below the faulty tileon a different neural network die than the faulty tile (404).

In some implementations, bypassing the faulty tile includes removingpower that is provided to the faulty tile 104 c. Power can be removedfrom the faulty tile 104 c by disconnecting a switch that provides powerto the faulty tile or using programming logic to remove power providedto the tile. Removing power provided to the faulty tile 104 c ensuresthat the faulty tile is not operational and the faulty tile 104 c doesnot draw unnecessary power from a power source providing power to thethree-dimensionally stacked neural network accelerator. Further, data istransmitted either around the faulty tile or through the faulty tile,but not to the processing element 202 of the faulty tile.

In other implementations, bypassing the faulty tile 104 c can includeturning off a clock that is unique to the faulty tile. The faulty tile'sclock can be disabled using programming logic or physically removing theclock from the circuit by disconnecting the clock's output. Turning offthe faulty tile's clock stops the clock from executing processingfunctions, thereby deactivating the faulty tile and ceasing the faultytile's operation.

According to the example illustrated in FIG. 5, the faulty tiles arebypassed and the dataflow configuration is modified according to aHamiltonian circuit representation of the tile dataflow configuration.For example, one implementation can include having the first tile 104 aon the top die 102 receives input data from an external source (1). Thefirst tile 104 a processes the data, and the first tile 104 a transmitsoutput data to the switch of the second tile 104 b. The switch 202 ofthe second tile 104 b of the top die 102 a directs the data to thetransmitting inductive coil (T) and transmits the data to the secondtile 104 b of the bottom die 102 b. For ease of illustration, theinductive coils have been presented as single coils, however, aspreviously mentioned, each of the inductive coils are a plurality ofcoils communicatively coupled to transmit and receive data from othercoils on vertically adjacent tiles.

The second tile 104 b of the bottom die 102 b processes the data (2) andtransmits the data to the third tile 104 c of the bottom die 102 b.Since, the third tile 104 c of the bottom die 102 b was determined to befaulty, the processing element is bypassed and the data is sent to thetransmitting inductive coil (T), which transmits the data usinginductive coupling to the receiving inductive coil (R) of the third tile104 c of the top die 102 b. The processing element processes the data(3) and transmits the data to the switch 202 of the fourth tile 104 d ofthe top die 102 a. The switch transmits the data to the transmittinginductive coil (T) of the fourth tile 104 d of the top die 102 a, whichtransmits the data to the fourth tile 104 d of the bottom die 102 b. Thefourth tile 104 d of the bottom die 102 b processes the data (4) and theoutput is transmitted to the eight tile 104 h of the bottom die 102 b.

The processing element of the eighth tile 104 h processes the data andthe output is transmitted to the switch of the seventh tile 104 g on thebottom die 102 b. The seventh tile 104 g processes the data (6) and theoutput is transmitted to the switch of the sixth tile 104 f of thebottom die 102 b. The sixth tile 104 f processes the data (7) and theoutput is transmitted to the switch of the fifth tile 104 e on thebottom die 102 b. The fifth tile's processing element processes the data(8) and the output is transmitted to the first tile 104 a of the bottomdie 102 b. The first tile 104 a processes the data (9) and the output istransmitted to the switch of the second tile 104 b of the bottom die 102b.

The second tile's switch transmits the data the second tile'stransmitting inductive coil (T), which transmits the data to thereceiving inductive coil (R) of the second tile 104 b of the top die 102a. The second tile's processing element processes the data (10) and theoutput is transmitted to the switch of the third tile 104 c of the topdie 102 b. The data is transmitted using the processing element bypassto bypass the third tile's processing element and the data istransmitted to the switch of the fourth tile 104 d of the top die 102 a.The fourth tile 104 d processes the data (11) and transmits the data tothe eight tile 104 h of the top die 102 a.

The eight tile 104 h processes the data (12) and the output istransmitted to the switch of the seventh tile 104 g of the top die 102a. As previously described, the seventh tile 104 g of the top die wasdetermined to be faulty during testing. Therefore, the seventh tile'sswitch transmits the data to the transmitting inductive coil (T) of theseventh tile 104 g. The transmitting inductive coil (T) directs the datato the receiving inductive coil (R) of the seventh tile 104 g of thebottom die 102 b. The seventh tile's switch directs the data to thesixth tile 104 f of the bottom die 102 b using the processing elementbypass. The switch of the sixth tile 104 f transmits the data to theswitch of the fifth tile 104 e of the bottom die 102 b using the sixthtile's processing element bypass.

The fifth tile 104 e directs the data to the transmitting inductive coil(T) of the fifth tile 104 e, which transmits the data to the receivinginductive coil (R) of the fifth tile 104 e of the top die 102 a. Thefifth tile 104 e processes the data (13) and the output is eitherdirected back to the first tile 102 a of the top die 102 a or directedto an external device. During testing of the three-dimensionally stackedneural network accelerator, the sixth die 104 f of the top die 102 a wasidentified as faulty. In this example, no data was routed to the sixthdie 104 f and the sixth die 104 f was completely bypassed.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non-transitory program carrier for execution by, or to controlthe operation of, data processing apparatus. Alternatively, or inaddition, the program instructions can be encoded on an artificiallygenerated propagated signal, e.g., a machine-generated electrical,optical, or electromagnetic signal, which is generated to encodeinformation for transmission to suitable receiver apparatus forexecution by a data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output(s). The processes and logic flows can also beperformed by, and apparatus can also be implemented as, special purposelogic circuitry, e.g., an FPGA (field programmable gate array), an ASIC(application specific integrated circuit), or a GPGPU (General purposegraphics processing unit).

Computers suitable for the execution of a computer program include, byway of example, can be based on general or special purposemicroprocessors or both, or any other kind of central processing unit.Generally, a central processing unit will receive instructions and datafrom a read only memory or a random access memory or both. The essentialelements of a computer are a central processing unit for performing orexecuting instructions and one or more memory devices for storinginstructions and data. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto optical disks, or optical disks. However, a computer need nothave such devices.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularinventions. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment. Conversely, various features thatare described in the context of a single embodiment can also beimplemented in multiple embodiments separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various system modulesand components in the embodiments described above should not beunderstood as requiring such separation in all embodiments, and itshould be understood that the described program components and systemscan generally be integrated together in a single software product orpackaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In certain implementations, multitasking and parallelprocessing may be advantageous.

What is claimed is:
 1. A circuit, comprising: a plurality of diesvertically stacked on top of each other, each die including a pluralityof tiles, each of the plurality of tiles comprising a multiplexer,wherein the multiplexer of each tile is connected to outputs of aplurality of other tiles in the plurality of tiles and controls whichoutput of the outputs of the plurality of other tiles the tile receivesas an input; wherein the plurality of tiles includes a first tile on afirst die that has been determined to be a faulty tile; and wherein adataflow configuration is configured to route the output of a tile onthe first die prior to the faulty tile on the first die in an initialdataflow configuration to an input of a tile on a second different dieand vertically adjacent to the faulty tile on the first die instead ofto an input of the faulty tile on the first die.
 2. The circuit of claim1, further comprising inductive coupling between tiles on different diesenabling communication between the tiles on different dies.
 3. Thecircuit of claim 1, further comprising, for each multiplexer, arespective controller coupled to the multiplexer, wherein the controlleris configured to transmit instructions to the multiplexer to designatean active input for the multiplexer.
 4. The circuit of claim 1, furthercomprising, for each wafer, a respective controller coupled to eachmultiplexer included on the wafer, wherein the controller is configuredto transmit instructions to each multiplexer included on the wafer todesignate an active input for each multiplexer.
 5. The circuit of claim1, further comprising, a controller coupled to each multiplexer, whereinthe controller is configured to transmit instructions to eachmultiplexer to designate an active input for each multiplexer.
 6. Thecircuit of claim 1, wherein to define the dataflow configuration toroute the output of the tile on the first die prior to the faulty tileon the first die in the initial dataflow configuration to an input of atile on the second different die and vertically adjacent to the faultytile on the first die, the input of the multiplexer of the tile on thesecond different die and vertically adjacent to the faulty tile on thefirst die is set to receive the output of the tile on the first dieprior to the faulty tile on the first die in the initial dataflowconfiguration.
 7. The circuit of claim 1, wherein the faulty tile hasbeen disabled.
 8. The circuit of claim 7, wherein disabling the faultytile comprises removing power that is distributed to the faulty tile. 9.The circuit of claim 7, wherein disabling the faulty tile comprisesturning off a clock that is unique to the faulty tile.
 10. The circuitof claim 1, wherein each tile communicates with one or more adjacenttiles to create a ring network of tiles, wherein the ring network isconfigured such that each functional tile within the ring networkreceives and transfers data.
 11. A method, comprising: obtaining dataspecifying that a tile from a plurality of tiles in athree-dimensionally stacked neural network accelerator is a faulty tile,wherein: the three-dimensionally stacked neural network acceleratorcomprises a plurality of neural network dies stacked on top of eachother, each neural network die including a respective plurality oftiles, each tile has input and output connections that route data intoand out of the tile, and the three-dimensionally stacked neural networkaccelerator is configured to process inputs by routing the input througheach of the plurality of tiles according to a dataflow configuration;and modifying the dataflow configuration to route an output of a tile ona first neural network die before the faulty tile on the first neuralnetwork die in the dataflow configuration to an input connection of atile that is positioned above or below the faulty tile on a seconddifferent neural network die than the faulty tile on the first neuralnetwork die.
 12. The method of claim 11, wherein the neural networkaccelerator further comprises a respective multiplexer for each of theplurality of tiles that control routing of data between tiles, andwherein modifying the dataflow configuration comprises configuring themultiplexer for the tile that is positioned above or below the faultytile on the second different neural network die than the faulty tile tocause the output of the tile on the first neural network die before thefaulty tile on the first neural network die to be routed to the inputconnection of the tile on the second different neural network die aboveor below the faulty tile.
 13. The method of claim 11, further comprisingdisabling the faulty tile.
 14. The method of claim 13, wherein disablingthe faulty tile comprises removing power that is distributed to thefaulty tile.
 15. The method of claim 13, wherein disabling the faultytile comprises turning off a clock that is unique to the faulty tile.16. The method of claim 11, further comprising analyzing functionalityof each of the plurality of tiles on the neural network dies todetermine that the tile is a faulty tile.
 17. The method of claim 16,wherein determining the tile is faulty comprises determining that thetile does not function as designed based on the analyzing.
 18. Themethod of claim 11, wherein each tile communicates with adjacent tilesabove or below the tile using inductive coupling.
 19. The method ofclaim 11, further comprising, routing the output of the tile above orbelow the faulty tile to an input of a tile after the faulty tile in thedataflow configuration.
 20. The method of claim 11, wherein each tilecommunicates with one or more adjacent tiles to create a ring network oftiles, and wherein the ring network is configured such that eachfunctional tile within the ring network receives and transfers data.